Group Member: LI Jiaying Zhang Xinyi Zhang Renzhi

Abstract

in this project, we proposed to use Gradient Boosting without feature engineering to be the most accurate model; and Decision Tree to be a balanced model between interpretability and accuracy

Read libararys

In [0]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
!pip install scorecardpy
import scorecardpy as sc
import matplotlib.pyplot as plt
# show plots automatically
%matplotlib inline

from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer,IterativeImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
!pip install pygam
from pygam import LogisticGAM,f,s
from sklearn.linear_model import LinearRegression
from patsy import dmatrix
import warnings
warnings.filterwarnings("ignore")
Collecting scorecardpy
  Downloading https://files.pythonhosted.org/packages/b3/fd/be6c15335e537d2360f85823cbb0b1c02cdd2ed9ff52539db78545db7115/scorecardpy-0.1.9.1.1.tar.gz (55kB)
     |████████████████████████████████| 61kB 3.0MB/s 
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from scorecardpy) (1.17.4)
Requirement already satisfied: pandas>=0.25.0 in /usr/local/lib/python3.6/dist-packages (from scorecardpy) (0.25.3)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from scorecardpy) (3.1.2)
Requirement already satisfied: scikit-learn>=0.19.1 in /usr/local/lib/python3.6/dist-packages (from scorecardpy) (0.21.3)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.25.0->scorecardpy) (2.6.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.25.0->scorecardpy) (2018.9)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->scorecardpy) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->scorecardpy) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->scorecardpy) (2.4.5)
Requirement already satisfied: scipy>=0.17.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.19.1->scorecardpy) (1.3.3)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.19.1->scorecardpy) (0.14.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas>=0.25.0->scorecardpy) (1.12.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from kiwisolver>=1.0.1->matplotlib->scorecardpy) (42.0.2)
Building wheels for collected packages: scorecardpy
  Building wheel for scorecardpy (setup.py) ... done
  Created wheel for scorecardpy: filename=scorecardpy-0.1.9.1.1-cp36-none-any.whl size=58169 sha256=469c6bf60a47e2cef63a8b087b6cd1ee5d153f2d897789b6f2a2c6855ea90e15
  Stored in directory: /root/.cache/pip/wheels/8f/4c/98/567ff70984acd88ffae211d8fc209e34b98fb43ff27f09fdd3
Successfully built scorecardpy
Installing collected packages: scorecardpy
Successfully installed scorecardpy-0.1.9.1.1
Collecting pygam
  Downloading https://files.pythonhosted.org/packages/13/be/775033ef08a8945bec6ad7973b161ca909f852442e0d7cfb8d1a214de1ac/pygam-0.8.0-py2.py3-none-any.whl (1.8MB)
     |████████████████████████████████| 1.8MB 5.1MB/s 
Requirement already satisfied: progressbar2 in /usr/local/lib/python3.6/dist-packages (from pygam) (3.38.0)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from pygam) (1.3.3)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from pygam) (0.16.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from pygam) (1.17.4)
Requirement already satisfied: python-utils>=2.3.0 in /usr/local/lib/python3.6/dist-packages (from progressbar2->pygam) (2.3.0)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from progressbar2->pygam) (1.12.0)
Installing collected packages: pygam
Successfully installed pygam-0.8.0

Load Data

In [0]:
from google.colab import drive
drive.mount('/content/drive')
In [0]:
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/HELOC/HelocData.csv')
In [0]:
data_dict = pd.read_csv("/content/drive/My Drive/Colab Notebooks/HELOC/B.csv")
var_names = data_dict.iloc[:,1]
var_names.name = 'Description'
var_names.to_frame()
Out[0]:
Description
0 Paid as negotiated flag (12-36 Months). String...
1 Consolidated version of risk markers
2 Months Since Oldest Trade Open
3 Months Since Most Recent Trade Open
4 Average Months in File
5 Number Satisfactory Trades
6 Number Trades 60+ Ever
7 Number Trades 90+ Ever
8 Percent Trades Never Delinquent
9 Months Since Most Recent Delinquency
10 Max Delq/Public Records Last 12 Months. See ta...
11 Max Delinquency Ever. See tab "MaxDelq" for ea...
12 Number of Total Trades (total number of credit...
13 Number of Trades Open in Last 12 Months
14 Percent Installment Trades
15 Months Since Most Recent Inq excl 7days
16 Number of Inq Last 6 Months
17 Number of Inq Last 6 Months excl 7days. Exclud...
18 Net Fraction Revolving Burden. This is revolvi...
19 Net Fraction Installment Burden. This is insta...
20 Number Revolving Trades with Balance
21 Number Installment Trades with Balance
22 Number Bank/Natl Trades w high utilization ratio
23 Percent Trades with Balance

Data Preprocessing

1. Data Cleaning

Missing Values

In [0]:
import matplotlib.pyplot as plt
psg_num = df.shape[0]
missingdata = df[(df == -7) | (df == -8) | (df == -9)].count()/psg_num
print((missingdata).apply(lambda x: format(x, '.2%')))
missingdata.plot.bar(x = 'missing data', y = missingdata, rot = 0, figsize = (13, 5))
plt.show()
RiskFlag     0.00%
x1           5.72%
x2           7.91%
x3           5.62%
x4           5.62%
x5           5.62%
x6           5.62%
x7           5.62%
x8           5.62%
x9          51.90%
x10          5.62%
x11          5.62%
x12          5.62%
x13          5.62%
x14          5.62%
x15         27.91%
x16          5.62%
x17          5.62%
x18          7.40%
x19         38.31%
x20          7.11%
x21         13.85%
x22         11.20%
x23          5.79%
dtype: object

Taking a closer look in the missing value of the data set, we found there are 588 samples with all variables = -9 (missing value).

We believe they carries no value of information and should be deleted at the beginning.

In [0]:
for index, row in df.iterrows():
    if sum([np.sqrt(-x) for x in row[1:].values]) == 69:
        df.at[index,'x24']=np.nan
    else:
        df.at[index,'x24']=1
df.dropna(inplace=True)
df=df.iloc[:,:-1]
df.replace(-7,np.nan,inplace=True)
df.replace(-8,np.nan,inplace=True)
df.replace(-9,np.nan,inplace=True)
df.shape
df.head()
Out[0]:
RiskFlag x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23
0 Bad 75.0 169.0 2 59 21 0 0 100 NaN 7 8 22 4 36 NaN 4 4 43.0 112.0 4.0 6.0 0.0 83.0
1 Bad 66.0 502.0 4 145 34 0 0 97 36.0 6 6 37 4 27 4.0 3 3 80.0 53.0 17.0 3.0 12.0 83.0
2 Good 69.0 338.0 2 62 22 0 0 96 12.0 6 6 23 3 35 0.0 4 4 25.0 100.0 3.0 2.0 1.0 45.0
3 Good 75.0 422.0 1 91 55 0 0 100 NaN 7 8 57 4 33 0.0 4 4 2.0 11.0 12.0 2.0 1.0 57.0
4 Bad 63.0 242.0 2 68 25 0 0 100 NaN 7 8 26 1 19 NaN 3 3 73.0 NaN 12.0 1.0 5.0 87.0

2. Variable Selection

  • Using the Graph above, we found that there are two variables containing over 38% missing values, namely x9 and x19.

    We believe these two variables should be deleted in the beginning to enhance the information density

  • Then, we did a rough variable selection by running three black box models, Multi-layer perceptron (MLP) classifier, support vector classifiers, and gradient boosting machines with the remaining 21 variables. Based on their post-hoc analysis of VI, we concluded that 9 variables (x6, x7, x10, x11, x13, x16, x17, x21, x22) are collectively evaluated to be of limited importance in all the three models.

    Thus, we decided to drop these 9 variables to enhance the sparsity of the model.

In [0]:
# drop unuseful columns and keep only useful features: x1, x5, x16, x17, x20
df_tmp = df.copy()

df_tmp.drop('x6', axis = 1, inplace = True)
df_tmp.drop('x7', axis = 1, inplace = True)
df_tmp.drop('x10', axis = 1, inplace = True)
df_tmp.drop('x11', axis = 1, inplace = True)
df_tmp.drop('x13', axis = 1, inplace = True)
df_tmp.drop('x16', axis = 1, inplace = True)
df_tmp.drop('x17', axis = 1, inplace = True)
df_tmp.drop('x21', axis = 1, inplace = True)
df_tmp.drop('x22', axis = 1, inplace = True)
df_tmp.drop('x9', axis = 1, inplace = True)
df_tmp.drop('x19', axis = 1, inplace = True)

df = df_tmp

3. Data Split

In [0]:
# Split the data into training and testing sets with UID as the random seed
np.random.seed(20190016) 
df_train, df_test = train_test_split(df, test_size=0.2)

train_size = df_train.shape[0]/df.shape[0]
test_size = df_test.shape[0]/df.shape[0]
print('Proportion of training set: {:.2f}%'.format(train_size))
print('Proportion of training set: {:.2f}%'.format(test_size))
Proportion of training set: 0.80%
Proportion of training set: 0.20%

4. Imputation

In [0]:
# handling missing value and drop unrelated columns
def df_preprocesser(df, II=None,set='train'):
    df_tmp = df.copy()
    # use Iterative Imputer for imputation
    label_encoder = LabelEncoder()
    df_tmp['RiskFlag'] = label_encoder.fit_transform(df_tmp['RiskFlag'])

    if set=='train':
        II = IterativeImputer().fit(df_tmp)
    df_tmp[:] = II.transform(df_tmp)
    
    return df_tmp, II
In [0]:
#3) Use training data set to fit Iterative Imputer and pass the fitted imputer to testing data.
df_train_clean, II= df_preprocesser(df_train, set='train')
df_test_clean= df_preprocesser(df_test,II, set='test')[0]
In [0]:
#After handling missing value and drop unrelated variables, the data frame looks like:
df_train_clean.head()
Out[0]:
RiskFlag x1 x2 x3 x4 x5 x8 x12 x14 x15 x18 x20 x23
8601 0.0 76.0 194.0 19.0 132.0 15.0 94.0 16.0 31.0 0.0 0.0 1.0 45.0
2897 0.0 55.0 89.0 18.0 45.0 13.0 87.0 15.0 13.0 0.0 37.0 3.0 57.0
1156 1.0 86.0 439.0 2.0 112.0 18.0 100.0 21.0 57.0 1.0 26.0 3.0 83.0
2920 1.0 63.0 232.0 2.0 103.0 14.0 88.0 17.0 6.0 1.0 11.0 1.0 17.0
6563 1.0 65.0 155.0 1.0 89.0 6.0 78.0 9.0 33.0 0.0 50.0 1.0 67.0

5. Feature Engineering

I. Scaling
II. Choose one preprocessor from:

  • Preprocessor A: Piecewise Linear
  • Preprocessor B: B Spline
  • Preprocessor C: Binning
  • Preprocessor D: Raw

Exploratory Data Analsyis

Data Distribution

according to the graph drawn below, some variables distribute differently depending on Good or Bad RiskFlag. Scaling them altogether may mess up such relationship.

Therefore, we only selected three variables suitable for scaling: x3, x8, x20

In [0]:
import matplotlib.pyplot as plt

fig, sub = plt.subplots(12, 1,figsize=(10,100))
df_train = df_train_clean
orders = df_train.iloc[:,1:].columns

for ax ,order in zip(sub.flatten(),orders):
    ax.hist(df_train[order][df_train['RiskFlag']==0],bins=20, label='0', density=True, alpha=0.75, color='red')
    ax.hist(df_train[order][df_train['RiskFlag']==1],bins=20, label='1', density=True, alpha=0.75, color='green')
    
    ax.legend(loc="upper right")
    ax.set_xlabel(order)

for i, order in enumerate(orders):
    df_train[order][df_train['RiskFlag']==0].plot(kind='density', color='red', label='', ax=sub[i])
    df_train[order][df_train['RiskFlag']==1].plot(kind='density', color='green', label='', ax=sub[i])
plt.savefig('Distribution.png')
plt.show()

Variable Correlation

There are two groups of variables highly related to each other: (x2,x4) and (x12,x5) (correlation > 0.65)

We propose that these 4 variable should not be binned, while we still implemented binning below to check its effect, instead of directly removing these 4.

In [0]:
f = plt.figure(figsize=(19, 15))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.shape[1]), df.columns, fontsize=14, rotation=45)
plt.yticks(range(df.shape[1]), df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16);

Scaling

In [0]:
from sklearn.preprocessing import StandardScaler
fare_scaler = StandardScaler()
for i in ['x3','x8','x20']:
  df_train_clean[i] = fare_scaler.fit_transform(df_train_clean[[i]])
  df_test_clean[i] = fare_scaler.transform(df_test_clean[[i]])

Piecewise Linear

In [0]:
# Piecewise ReLU, K remain to be decided (Number of output features)
def df_PieceReLU(df, tau=None,set='train'):
    df_tmp = df.copy()
    name=df.name
    if set=='train':
        K = 4
        tau = np.linspace(df_tmp.min(),df_tmp.max(),K+2)[1:-1]
        xphi = df_tmp
        for k in range(len(tau)):
            tmp = [max(x1-tau[k], 0) for x1 in df_tmp]
            xphi = np.column_stack((xphi, tmp))

    xphi = df_tmp
    for k in range(len(tau)):
        tmp = [max(x1-tau[k], 0) for x1 in df_tmp]
        xphi = np.column_stack((xphi, tmp))
        
    df_tmp = pd.DataFrame(xphi)     
    df_tmp.drop(0, axis = 1, inplace = True)
    for i in range(0,df_tmp.shape[1]+1):
        df_tmp = df_tmp.rename(columns={i: name+'_'+str(i)})

    return df_tmp, tau
In [0]:
# Call Piecewise ReLU for a single original variable, take the original variable name(eg.x2), return a dataframe
#for column in df_train_clean.columns: 
#  df_train_PieceReLU, tau= df_PieceReLU(df_train_clean[column], set='train')
#  df_test_PieceReLU= df_PieceReLU(df_test_clean[column],tau, set='test')[0]

# Call Piecewise ReLU for a single original variable, take the original variable name(eg.x2), return a dataframe
df_train_X_PieceReLU_intergrated=pd.DataFrame()
df_test_X_PieceReLU_intergrated=pd.DataFrame()

for column in df_train_clean.columns[1:]:
    df_train_PieceReLU, tau= df_PieceReLU(df_train_clean[column], set='train')
    df_test_PieceReLU= df_PieceReLU(df_test_clean[column],tau, set='test')[0]
    df_train_X_PieceReLU_intergrated = pd.concat([df_train_X_PieceReLU_intergrated,df_train_PieceReLU],axis=1)
    df_test_X_PieceReLU_intergrated = pd.concat([df_test_X_PieceReLU_intergrated,df_test_PieceReLU],axis=1)

x_train_PieceReLU = df_train_X_PieceReLU_intergrated.iloc[:,1:].values
x_test_PieceReLU = df_test_X_PieceReLU_intergrated.iloc[:,1:].values

feature_names_PieceReLU=df_train_X_PieceReLU_intergrated.columns[:].values

B Spline

In [0]:
import itertools

def B_Spline(df):
    df_tmp1 = df.copy()
    df_tmp2 = df.copy()
    from patsy import dmatrix
    for column in df_tmp1.columns[1:]:
        column_bs = dmatrix('bs(x, df=4, degree=2)-1', {'x': df_tmp1[column]}, return_type='dataframe')
        df_tmp2 = pd.concat([df_tmp2, column_bs], axis=1)
        df_tmp2.drop(column, axis=1, inplace=True)
    lst = list(df.columns)[1:]
    lst = list(itertools.chain.from_iterable(itertools.repeat(x+'_', 4) for x in lst))
    lst.insert(0,'RiskFlag')
    df_tmp2.columns = lst
    return df_tmp2

df_train_BSpline = B_Spline(df_train_clean)
df_test_BSpline = B_Spline(df_test_clean)

x_train_BSpline = df_train_BSpline.iloc[:,1:].values
x_test_BSpline = df_test_BSpline.iloc[:,1:].values

feature_names_BSpline=df_train_BSpline.columns[1:].values

IV Binning

In [0]:
import scorecardpy as sc
from sklearn.preprocessing import OneHotEncoder

def IVBinning(df_train_clean):
  # df_train_clean
  df_train0 = df_train_clean
  df_X_train0 = df_train_clean.iloc[:,1:]
  # obtain IV bins with training set info
  bins = sc.woebin(df_train0, y='RiskFlag', method='tree')

  # bin each column in training set and combine together
  df_X_train1_IV = df_X_train0.copy()
  for _, col_name in enumerate(df_X_train0.columns):
      breaks = bins[col_name]['breaks'].values.astype(np.float)
      breaks = np.insert(breaks,0,-np.inf)
      
      # set `labels=False` to get binned categorical columns encoded as integer values
      df_X_train1_IV[col_name] = pd.cut(df_X_train0[col_name], bins=breaks, right=True, labels=False)


  # similar in testing set
  # wrong if adding: bins = sc.woebin(df_test0, y='RiskFlag', method='tree')
  df_X_test0 = df_test_clean.iloc[:,1:]
  df_X_test1_IV = df_X_test0.copy()
  for _, col_name in enumerate(df_X_test0.columns):
      breaks = bins[col_name]['breaks'].values.astype(np.float)
      breaks = np.insert(breaks,0,-np.inf)
      
      # set `labels=False` to get binned categorical columns encoded as integer values
      df_X_test1_IV[col_name] = pd.cut(df_X_test0[col_name], bins=breaks, right=True, labels=False)
      
  enc = OneHotEncoder(handle_unknown='ignore')
  X_train_encoded = enc.fit_transform(df_X_train1_IV.values)
  X_test_encoded = enc.transform(df_X_test1_IV.values)
  x_train = X_train_encoded
  x_test = X_test_encoded
  feature_names = enc.get_feature_names(df_train_clean.iloc[:,1:].columns)
  return (X_train_encoded,X_test_encoded,feature_names)
  
(x_train_BinAll,x_test_BinAll,feature_names_BinAll) = IVBinning(df_train_clean)
[INFO] creating woe binning ...
Binning on 7896 rows and 13 columns in 00:00:16

Raw Data

  • No Feature Engineering (Post Imputing, Pre Scaling)
In [0]:
x_train_raw = df_train_clean.iloc[:,1:].values
y_train = df_train_clean.iloc[:,0].values

x_test_raw = df_test_clean.iloc[:,1:].values
y_test = df_test_clean.iloc[:,0].values

feature_names_raw = df_test_clean.columns[1:].values

Model Fitting

In [0]:
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.inspection import plot_partial_dependence as pdp
!pip install pandasql
from pandasql import sqldf

!pip install eli5
import eli5
from eli5.sklearn import PermutationImportance
Collecting pandasql
  Downloading https://files.pythonhosted.org/packages/6b/c4/ee4096ffa2eeeca0c749b26f0371bd26aa5c8b611c43de99a4f86d3de0a7/pandasql-0.7.3.tar.gz
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from pandasql) (1.17.4)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from pandasql) (0.25.3)
Requirement already satisfied: sqlalchemy in /usr/local/lib/python3.6/dist-packages (from pandasql) (1.3.11)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->pandasql) (2018.9)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas->pandasql) (2.6.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas->pandasql) (1.12.0)
Building wheels for collected packages: pandasql
  Building wheel for pandasql (setup.py) ... done
  Created wheel for pandasql: filename=pandasql-0.7.3-cp36-none-any.whl size=26820 sha256=490b9f375ddc8a0d54e5f580a9a4db587a4e57bc4502cd74eabf44f96a282c41
  Stored in directory: /root/.cache/pip/wheels/53/6c/18/b87a2e5fa8a82e9c026311de56210b8d1c01846e18a9607fc9
Successfully built pandasql
Installing collected packages: pandasql
Successfully installed pandasql-0.7.3
Collecting eli5
  Downloading https://files.pythonhosted.org/packages/97/2f/c85c7d8f8548e460829971785347e14e45fa5c6617da374711dec8cb38cc/eli5-0.10.1-py2.py3-none-any.whl (105kB)
     |████████████████████████████████| 112kB 4.8MB/s 
Requirement already satisfied: jinja2 in /usr/local/lib/python3.6/dist-packages (from eli5) (2.10.3)
Requirement already satisfied: scikit-learn>=0.18 in /usr/local/lib/python3.6/dist-packages (from eli5) (0.21.3)
Requirement already satisfied: numpy>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from eli5) (1.17.4)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from eli5) (1.3.3)
Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (from eli5) (0.10.1)
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.6/dist-packages (from eli5) (0.8.6)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from eli5) (1.12.0)
Requirement already satisfied: attrs>16.0.0 in /usr/local/lib/python3.6/dist-packages (from eli5) (19.3.0)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2->eli5) (1.1.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.18->eli5) (0.14.1)
Installing collected packages: eli5
Successfully installed eli5-0.10.1
Using TensorFlow backend.

The default version of TensorFlow in Colab will soon switch to TensorFlow 2.x.
We recommend you upgrade now or ensure your notebook will continue to use TensorFlow 1.x via the %tensorflow_version 1.x magic: more info.

Dataset naming

  • A: x_train_PieceReLUx_test_PieceReLU, feature_names_PieceReLU
  • B: x_train_BSplinex_test_BSplinefeature_names_BSpline
  • C: x_train_BinAll, x_test_BinAll, feature_names_BinAll
  • D: x_train_raw, x_test_raw, feature_names_raw

Model 1: Generalized Additive Model

(a) Piecewise ReLU

In [0]:
from pygam import LogisticGAM, s

# build piecewise linear spline
n_splines = 4

k = s(0, n_splines=n_splines, spline_order=1)
for i in range(1,12):
    k += s(i, n_splines=n_splines, spline_order=1)
print(k)

p_spl = LogisticGAM(k)
p_spl.gridsearch(x_train_raw, y_train) 

y_pred_train = p_spl.predict(x_train_raw)
y_pred_test = p_spl.predict(x_test_raw)

print('The Acc on training set:',accuracy_score(y_train,y_pred_train).round(4))
print('The Acc on testing set:',accuracy_score(y_test,y_pred_test).round(4))
N/A% (0 of 11) |                         | Elapsed Time: 0:00:00 ETA:  --:--:--
s(0) + s(1) + s(2) + s(3) + s(4) + s(5) + s(6) + s(7) + s(8) + s(9) + s(10) + s(11)
100% (11 of 11) |########################| Elapsed Time: 0:00:03 Time:  0:00:03
The Acc on training set: 0.7421
The Acc on testing set: 0.7256

(b) B Spline *

In [0]:
from pygam import LogisticGAM, s

# build piecewise linear spline
n_splines = 4

k = s(0, n_splines=n_splines)
for i in range(1,12):
    k += s(i, n_splines=n_splines)
print(k)

p_spl = LogisticGAM(k)
p_spl.gridsearch(x_train_raw, y_train) 

y_pred_train = p_spl.predict(x_train_raw)
y_pred_test = p_spl.predict(x_test_raw)

print('The Acc on training set:',accuracy_score(y_train,y_pred_train).round(4))
print('The Acc on testing set:',accuracy_score(y_test,y_pred_test).round(4))
N/A% (0 of 11) |                         | Elapsed Time: 0:00:00 ETA:  --:--:--
s(0) + s(1) + s(2) + s(3) + s(4) + s(5) + s(6) + s(7) + s(8) + s(9) + s(10) + s(11)
100% (11 of 11) |########################| Elapsed Time: 0:00:03 Time:  0:00:03
The Acc on training set: 0.743
The Acc on testing set: 0.7271

(c) Binning

In [0]:
from sklearn.preprocessing import KBinsDiscretizer
# fit the binner with training set info
# set `encode='ordinal'` instead of 'onehot-dense' to get binned categorical columns encoded as integer values

df_train_tmp = df_train_clean.iloc[:,1:]
df_test_tmp = df_test_clean.iloc[:,1:]

KBD = KBinsDiscretizer(encode='ordinal', strategy='uniform')
KBD.fit(df_train_tmp)

# transform the training and testing sets with fitted binner
df_X_train1_KBD = df_train_tmp.copy()
df_X_train1_KBD[:] = KBD.transform(df_train_tmp)

df_X_test1_KBD = df_test_tmp.copy()
df_X_test1_KBD[:] = KBD.transform(df_test_tmp) 

df_X_train1_KBD.head()
Out[0]:
x1 x2 x3 x4 x5 x8 x12 x14 x15 x18 x20 x23
8601 2.0 1.0 0.0 1.0 0.0 4.0 0.0 1.0 0.0 0.0 1.0 1.0
2897 1.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0 1.0 1.0 2.0
1156 3.0 2.0 0.0 1.0 1.0 4.0 1.0 2.0 0.0 1.0 1.0 3.0
2920 1.0 1.0 0.0 1.0 0.0 4.0 0.0 0.0 0.0 0.0 1.0 0.0
6563 2.0 0.0 0.0 1.0 0.0 3.0 0.0 1.0 0.0 1.0 1.0 2.0
In [0]:
from pygam import LogisticGAM,f,s

k = f(0)
for i in range(1,12):
    k += f(i)
print(k)

gam1_KBD = LogisticGAM(k)
gam1_KBD.gridsearch(df_X_train1_KBD.values, y_train)

y_pred_train = gam1_KBD.predict(df_X_train1_KBD.values)
y_pred_test = gam1_KBD.predict(df_X_test1_KBD.values)

print('The Acc on training set:',accuracy_score(y_train,y_pred_train).round(4))
print('The Acc on testing set:',accuracy_score(y_test,y_pred_test).round(4))
N/A% (0 of 11) |                         | Elapsed Time: 0:00:00 ETA:  --:--:--
f(0) + f(1) + f(2) + f(3) + f(4) + f(5) + f(6) + f(7) + f(8) + f(9) + f(10) + f(11)
100% (11 of 11) |########################| Elapsed Time: 0:00:03 Time:  0:00:03
The Acc on training set: 0.7172
The Acc on testing set: 0.7114

(d) Raw Data

In [0]:
from pygam import LogisticGAM, s

# build piecewise linear spline
n_splines = 1

k = s(0, n_splines=n_splines, spline_order=0)
for i in range(1,12):
    k += s(i, n_splines=n_splines, spline_order=0)
print(k)

p_spl = LogisticGAM(k)
p_spl.gridsearch(x_train_raw, y_train) 

y_pred_train = p_spl.predict(x_train_raw)
y_pred_test = p_spl.predict(x_test_raw)

print('The Acc on training set:',accuracy_score(y_train,y_pred_train).round(4))
print('The Acc on testing set:',accuracy_score(y_test,y_pred_test).round(4))
  9% (1 of 11) |##                       | Elapsed Time: 0:00:00 ETA:   0:00:01
s(0) + s(1) + s(2) + s(3) + s(4) + s(5) + s(6) + s(7) + s(8) + s(9) + s(10) + s(11)
100% (11 of 11) |########################| Elapsed Time: 0:00:01 Time:  0:00:01
The Acc on training set: 0.5171
The Acc on testing set: 0.5332

Model 2: Decision Tree

(a) Piecewise ReLU

In [0]:
# fit classification tree with depth 3
from sklearn.tree import DecisionTreeClassifier
dt_clf1 = DecisionTreeClassifier(max_depth=3)
dt_clf1.fit(x_train_PieceReLU, y_train)

y_pred_train = dt_clf1.predict(x_train_PieceReLU)
y_pred_test = dt_clf1.predict(x_test_PieceReLU)

print('The Acc on training set:',accuracy_score(y_train,y_pred_train).round(4))
print('The Acc on testing set:',accuracy_score(y_test,y_pred_test).round(4))
The Acc on training set: 0.712
The Acc on testing set: 0.6967
In [0]:
## Grid Search SCV
from sklearn.model_selection import GridSearchCV
tuned_parameters={'min_samples_split' : range(10,500,20),'max_depth': range(1,20,2)}
clf=GridSearchCV(DecisionTreeClassifier(),tuned_parameters,scoring='accuracy',cv=5,return_train_score=True)
clf.fit(x_train_PieceReLU, y_train)

# print the best parameters
print('Best Parameters:',clf.best_params_,'\n',
      "Training accuracy for gb_clf:",accuracy_score(y_train, clf.predict(x_train_PieceReLU)).round(4),'\n',
      "Testing accuracy for gb_clf:",accuracy_score(y_test, clf.predict(x_test_PieceReLU)).round(4))
Best Parameters: {'max_depth': 5, 'min_samples_split': 450} 
 Training accuracy for gb_clf: 0.723 
 Testing accuracy for gb_clf: 0.7013

(b) B Spline

In [0]:
# fit classification tree with depth 3
from sklearn.tree import DecisionTreeClassifier
dt_clf1 = DecisionTreeClassifier(max_depth=3)
dt_clf1.fit(x_train_BSpline, y_train)

y_pred_train = dt_clf1.predict(x_train_BSpline)
y_pred_test = dt_clf1.predict(x_test_BSpline)

print('The Acc on training set:',accuracy_score(y_train,y_pred_train).round(4))
print('The Acc on testing set:',accuracy_score(y_test,y_pred_test).round(4))
The Acc on training set: 0.7466
The Acc on testing set: 0.6942
In [0]:
## Grid Search SCV
from sklearn.model_selection import GridSearchCV
tuned_parameters={'min_samples_split' : range(10,500,20),'max_depth': range(1,20,2)}
clf=GridSearchCV(DecisionTreeClassifier(),tuned_parameters,scoring='accuracy',cv=5,return_train_score=True)
clf.fit(x_train_BSpline, y_train)

# print the best parameters
print('Best Parameters:',clf.best_params_,'\n',
      "Training accuracy for gb_clf:",accuracy_score(y_train, clf.predict(x_train_BSpline)).round(4),'\n',
      "Testing accuracy for gb_clf:",accuracy_score(y_test, clf.predict(x_test_BSpline)).round(4))
Best Parameters: {'max_depth': 5, 'min_samples_split': 230} 
 Training accuracy for gb_clf: 0.757 
 Testing accuracy for gb_clf: 0.6992

(c) IV Binning

In [0]:
# fit classification tree with depth 3
from sklearn.tree import DecisionTreeClassifier
dt_clf1 = DecisionTreeClassifier(max_depth=3)
dt_clf1.fit(x_train_BinAll, y_train)

y_pred_train = dt_clf1.predict(x_train_BinAll)
y_pred_test = dt_clf1.predict(x_test_BinAll)

print('The Acc on training set:',accuracy_score(y_train,y_pred_train).round(4))
print('The Acc on testing set:',accuracy_score(y_test,y_pred_test).round(4))
The Acc on training set: 0.7225
The Acc on testing set: 0.7266
In [0]:
## Grid Search SCV
from sklearn.model_selection import GridSearchCV
tuned_parameters={'min_samples_split' : range(10,500,20),'max_depth': range(1,20,2)}
clf=GridSearchCV(DecisionTreeClassifier(),tuned_parameters,scoring='accuracy',cv=5,return_train_score=True)
clf.fit(x_train_BinAll, y_train)

# print the best parameters
print('Best Parameters:',clf.best_params_,'\n',
      "Training accuracy for gb_clf:",accuracy_score(y_train, clf.predict(x_train_BinAll)).round(4),'\n',
      "Testing accuracy for gb_clf:",accuracy_score(y_test, clf.predict(x_test_BinAll)).round(4))
Best Parameters: {'max_depth': 13, 'min_samples_split': 150} 
 Training accuracy for gb_clf: 0.759 
 Testing accuracy for gb_clf: 0.7382

(d) Raw Data *

In [0]:
# fit classification tree with depth 3
from sklearn.tree import DecisionTreeClassifier
dt_clf1 = DecisionTreeClassifier(max_depth=3)
dt_clf1.fit(x_train_raw, y_train)

y_pred_train = dt_clf1.predict(x_train_raw)
y_pred_test = dt_clf1.predict(x_test_raw)

print('The Acc on training set:',accuracy_score(y_train,y_pred_train).round(4))
print('The Acc on testing set:',accuracy_score(y_test,y_pred_test).round(4))
The Acc on training set: 0.747
The Acc on testing set: 0.7357
In [0]:
## Grid Search SCV
from sklearn.model_selection import GridSearchCV
tuned_parameters={'min_samples_split' : range(10,500,20),'max_depth': range(1,20,2)}
clf=GridSearchCV(DecisionTreeClassifier(),tuned_parameters,scoring='accuracy',cv=5,return_train_score=True)
clf.fit(x_train_raw, y_train)

# print the best parameters
print('Best Parameters:',clf.best_params_,'\n',
      "Training accuracy for gb_clf:",accuracy_score(y_train, clf.predict(x_train_raw)).round(4),'\n',
      "Testing accuracy for gb_clf:",accuracy_score(y_test, clf.predict(x_test_raw)).round(4))
Best Parameters: {'max_depth': 7, 'min_samples_split': 170} 
 Training accuracy for gb_clf: 0.7812 
 Testing accuracy for gb_clf: 0.7524

Model 3: GradientBoostingClassifier

In [0]:
#Fit the gradient boosting machines

def PrintGBAcc(X_train,X_test,y_train,y_test,feature_engineer):
  from sklearn.svm import SVC
  from sklearn.svm import LinearSVC
  gb_clf = GradientBoostingClassifier(learning_rate=0.1, max_depth=5, n_estimators=60)
  gb_clf.fit(X_train, y_train)
  gb_pred_train= gb_clf.predict(X_train)
  gb_pred_test= gb_clf.predict(X_test)
  print('Training accuracy for the gradient boosting machines is',feature_engineer,':', accuracy_score(y_train, gb_pred_train).round(4))
  print('Testing accuracy for the gradient boosting machines is',feature_engineer,':', accuracy_score(y_test, gb_pred_test).round(4))

PrintGBAcc(x_train_PieceReLU,x_test_PieceReLU,y_train,y_test,'PieceReLU')
PrintGBAcc(x_train_BSpline,x_test_BSpline,y_train,y_test,'BSpline')
PrintGBAcc(x_train_BinAll,x_test_BinAll,y_train,y_test,'IV Binning')
PrintGBAcc(x_train_raw,x_test_raw,y_train,y_test,'Raw Data')
Training accuracy for the gradient boosting machines is PieceReLU : 0.7794
Testing accuracy for the gradient boosting machines is PieceReLU : 0.7028
Training accuracy for the gradient boosting machines is BSpline : 0.8356
Testing accuracy for the gradient boosting machines is BSpline : 0.7195
Training accuracy for the gradient boosting machines is IV Binning : 0.7938
Testing accuracy for the gradient boosting machines is IV Binning : 0.7549
Training accuracy for the gradient boosting machines is Raw Data : 0.8294
Testing accuracy for the gradient boosting machines is Raw Data : 0.7742
In [0]:
## Random Search SCV

from sklearn.model_selection import RandomizedSearchCV
# RandomizedSearchCV automatically returns the best model.

# Hyper-parameter space.
tuned_parameters = { 'max_depth': np.arange(4,9,1),'learning_rate': uniform(loc=0.08,scale=0.04),'n_estimators': np.arange(50,100,10)
                      }

clf = RandomizedSearchCV(GradientBoostingClassifier(), tuned_parameters,scoring='accuracy',cv=5,n_iter=50)
clf.fit(x_train_raw, y_train)
# print the best parameters
print('Best Parameters:',clf.best_params_,'\n',
      "Training accuracy for gb_clf:",accuracy_score(y_train, clf.predict(x_train_raw)).round(4),'\n',
      "Testing accuracy for gb_clf:",accuracy_score(y_test, clf.predict(x_test_raw)).round(4))
Best Parameters: {'learning_rate': 0.09981653591378371, 'max_depth': 5, 'n_estimators': 90} 
 Training accuracy for gb_clf: 0.8451 
 Testing accuracy for gb_clf: 0.7742

Model 4: Supported Vector Machine

LinearSVC and SVC-rbf

In [0]:
## Benchmark ACC. 0.7362 (Linear SVM)

def PrintSVCAcc(X_train,X_test,y_train,y_test,feature_engineer):
  from sklearn.svm import SVC
  from sklearn.svm import LinearSVC

  # fit the linear kernel model
  linSVC = LinearSVC(max_iter=10000)
  linSVC.fit(X_train,y_train)
  # show accuracy 
  print('Linear SVM Accuracy',feature_engineer,':',accuracy_score(y_test,linSVC.predict(X_test)).round(4))

  # fit the model with RBF kernel
  rbfSVC = SVC(degree=3,C=7,kernel='rbf')
  rbfSVC.fit(X_train,y_train)
  # show accuracy 
  print('RBF SVM Accuracy',feature_engineer,':',accuracy_score(y_test,rbfSVC.predict(X_test)).round(4))

PrintSVCAcc(x_train_PieceReLU,x_test_PieceReLU,y_train,y_test,'PieceReLU')
PrintSVCAcc(x_train_BSpline,x_test_BSpline,y_train,y_test,'BSpline')
PrintSVCAcc(x_train_BinAll,x_test_BinAll,y_train,y_test,'IV Binning')
PrintSVCAcc(x_train_raw,x_test_raw,y_train,y_test,'Raw Data')
Linear SVM Accuracy PieceReLU : 0.6208
RBF SVM Accuracy PieceReLU : 0.5894
Linear SVM Accuracy BSpline : 0.718
RBF SVM Accuracy BSpline : 0.7119
Linear SVM Accuracy IV Binning : 0.7387
RBF SVM Accuracy IV Binning : 0.7423
Linear SVM Accuracy Raw Data : 0.6916
RBF SVM Accuracy Raw Data : 0.5342
In [0]:
## Random Search RBF SCV

from scipy.stats import uniform
from sklearn.model_selection import RandomizedSearchCV

## RandomSearch RBF
def PrintRandomSearchRBF(X_train,X_test,y_train,y_test,feature_engineering):
  tuned_parameters = {'kernel': ['rbf'],
                      'C': uniform(loc=3,scale=5),
                      'degree': [3]}
  clf = RandomizedSearchCV(SVC(), tuned_parameters, cv=5,n_iter=10,scoring='accuracy')
  clf.fit(X_train, y_train)
  # print the best parameters
  print('RBF - Best Parameters',feature_engineering,':',clf.best_params_,'\n',
        'RBF - Training accuracy',feature_engineering,':',accuracy_score(y_train, clf.predict(X_train)).round(4),'\n',
        'RBF - Testing accuracy for gb_clf',feature_engineering,':',accuracy_score(y_test, clf.predict(X_test)).round(4))

PrintRandomSearchRBF(x_train_BSpline ,x_test_BSpline ,y_train,y_test,'BSpline')

print('\n')

## Random Search LinearSVC
def PrintRandomSearchLinearSVC(X_train,X_test,y_train,y_test,feature_engineering):
  tuned_parameters = {'C': uniform(loc=3,scale=5),
                      'max_iter': uniform(loc=10000,scale=50000)}
  clf = RandomizedSearchCV(LinearSVC(), tuned_parameters, cv=5,n_iter=10,scoring='accuracy')
  clf.fit(X_train, y_train)
  # print the best parameters
  print('LinearSVC - Best Parameters',feature_engineering,':',clf.best_params_,'\n',
        'LinearSVC - Training accuracy',feature_engineering,':',accuracy_score(y_train, clf.predict(X_train)).round(4),'\n',
        'LinearSVC - Testing accuracy for gb_clf',feature_engineering,':',accuracy_score(y_test, clf.predict(X_test)).round(4))
PrintRandomSearchLinearSVC(x_train_BSpline ,x_test_BSpline ,y_train,y_test,'BSpline')
RBF - Best Parameters BSpline : {'C': 5.492484455047691, 'degree': 3, 'kernel': 'rbf'} 
 RBF - Training accuracy BSpline : 0.7397 
 RBF - Testing accuracy for gb_clf BSpline : 0.7144


LinearSVC - Best Parameters BSpline : {'C': 7.824124490611895, 'max_iter': 54158.235945119704} 
 LinearSVC - Training accuracy BSpline : 0.7409 
 LinearSVC - Testing accuracy for gb_clf BSpline : 0.7154

Model 5: Neural Network

(a) Piecewise ReLU

In [0]:
# Hyper-parameter space.
tuned_parameters = {'solver': ['adam'],'alpha': 10.0 ** -np.arange(-4, 4,4), 'hidden_layer_sizes':np.arange(150, 300,50), 'random_state':[0,1,2,3]

                    #,'colsample_bytree' : [0.3], 'learning_rate' : [0.1,0.3,0.07]
                    #,'max_depth' : [5], 'alpha' : [10], 'n_estimators' : [20]
                    #,'monotone_constraints':[(1,-1)]
                }

# GridSearchCV automatically returns the best model.
MLP_PieceReLU = RandomizedSearchCV(MLPClassifier(max_iter=1000), tuned_parameters, cv=5,n_iter=10,
                       scoring='accuracy',random_state=2019)
MLP_PieceReLU.fit(x_train_PieceReLU, y_train)

feature_names=feature_names_PieceReLU
# print the best parameters
print('Training accuracy for the MLPClassifier with Piecewise ReLU feature engineering is:', accuracy_score(y_train,MLP_PieceReLU.predict(x_train_PieceReLU)).round(4))
print('Testing accuracy for the MLPClassifier with Piecewise ReLU feature engineering is:', accuracy_score(y_test,MLP_PieceReLU.predict(x_test_PieceReLU)).round(4))
MLP_PieceReLU

(b) B Spline

In [0]:
## GridSearch SCV

# Hyper-parameter space.
tuned_parameters = {'solver': ['adam'],'alpha': 10.0 ** -np.arange(-4, 4,4), 'hidden_layer_sizes':np.arange(150, 300,50), 'random_state':[0,1,2,3]

                    #,'colsample_bytree' : [0.3], 'learning_rate' : [0.1,0.3,0.07]
                    #,'max_depth' : [5], 'alpha' : [10], 'n_estimators' : [20]
                    #,'monotone_constraints':[(1,-1)]
                }


# GridSearchCV automatically returns the best model.
MLP_BSpline = GridSearchCV(MLPClassifier(max_iter=1000),tuned_parameters,scoring='accuracy',cv=5,return_train_score=True)
MLP_BSpline.fit(x_train_BSpline, y_train)

#feature_names=feature_name_BSpline
# print the best parameters
print('Training accuracy for the MLPClassifier with BSpline feature engineering is:', accuracy_score(y_train,MLP_BSpline.predict(x_train_BSpline)).round(4))
print('Testing accuracy for the MLPClassifier with BSpline feature engineering is:', accuracy_score(y_test,MLP_BSpline.predict(x_test_BSpline)).round(4))
MLP_BSpline

(c) IV Binning*

In [0]:
## GridSearch SCV
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier
# Hyper-parameter space.
tuned_parameters = {'solver': ['adam'],'alpha': 10.0 ** -np.arange(-4, 4,4), 'hidden_layer_sizes':np.arange(150, 250,50)

                    #,'colsample_bytree' : [0.3], 'learning_rate' : [0.1,0.3,0.07]
                    #,'max_depth' : [5], 'alpha' : [10], 'n_estimators' : [20]
                    #,'monotone_constraints':[(1,-1)]
                }


# GridSearchCV automatically returns the best model.
MLP_BinAll = GridSearchCV(MLPClassifier(max_iter=1000),tuned_parameters,scoring='accuracy',cv=5,return_train_score=True)
MLP_BinAll.fit(x_train_BinAll, y_train)

#feature_names=feature_names_BinAll
# print the best parameters
print('Training accuracy for the MLPClassifier with BinAll feature engineering is:', accuracy_score(y_train,MLP_BinAll.predict(x_train_BinAll)).round(4))
print('Testing accuracy for the MLPClassifier with BinAll feature engineering is:', accuracy_score(y_test,MLP_BinAll.predict(x_test_BinAll)).round(4))
MLP_BinAll
Training accuracy for the MLPClassifier with BinAll feature engineering is: 0.7601
Testing accuracy for the MLPClassifier with BinAll feature engineering is: 0.7448
Out[0]:
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=MLPClassifier(activation='relu', alpha=0.0001,
                                     batch_size='auto', beta_1=0.9,
                                     beta_2=0.999, early_stopping=False,
                                     epsilon=1e-08, hidden_layer_sizes=(100,),
                                     learning_rate='constant',
                                     learning_rate_init=0.001, max_iter=1000,
                                     momentum=0.9, n_iter_no_change=10,
                                     nesterovs_momentum=True, power_t=0.5,
                                     random_state=None, shuffle=True,
                                     solver='adam', tol=0.0001,
                                     validation_fraction=0.1, verbose=False,
                                     warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'alpha': array([1.e+04, 1.e+00]),
                         'hidden_layer_sizes': array([150, 200]),
                         'solver': ['adam']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='accuracy', verbose=0)

(d) Raw Data

In [0]:
## GridSearch SCV

# Hyper-parameter space.
tuned_parameters = {'solver': ['adam'],'alpha': 10.0 ** -np.arange(-4, 4,4), 'hidden_layer_sizes':np.arange(150, 300,50), 'random_state':[0,1,2,3]

                    #,'colsample_bytree' : [0.3], 'learning_rate' : [0.1,0.3,0.07]
                    #,'max_depth' : [5], 'alpha' : [10], 'n_estimators' : [20]
                    #,'monotone_constraints':[(1,-1)]
                }


# GridSearchCV automatically returns the best model.
MLP_Raw = GridSearchCV(MLPClassifier(max_iter=1000),tuned_parameters,scoring='accuracy',cv=5,return_train_score=True)
MLP_Raw.fit(x_train_raw, y_train)

#feature_names=feature_names_raw
# print the best parameters
print('Training accuracy for the MLPClassifier with Raw data is:', accuracy_score(y_train,MLP_Raw.predict(x_train_raw)).round(4))
print('Testing accuracy for the MLPClassifier with Raw data is:', accuracy_score(y_test,MLP_Raw.predict(x_test_raw)).round(4))
MLP_Raw

Interpretation

MLP + IV Binning

In [0]:
!pip install shap 
import shap
shap.initjs()
Collecting shap
  Downloading https://files.pythonhosted.org/packages/7c/e2/4050c2e68639adf0f8a8a4857f234b71fe1a1139e25ff17575c935f49615/shap-0.33.0.tar.gz (263kB)
     |████████████████████████████████| 266kB 4.9MB/s 
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from shap) (1.17.4)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from shap) (1.3.3)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from shap) (0.21.3)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from shap) (0.25.3)
Requirement already satisfied: tqdm>4.25.0 in /usr/local/lib/python3.6/dist-packages (from shap) (4.28.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->shap) (0.14.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->shap) (2018.9)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas->shap) (2.6.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas->shap) (1.12.0)
Building wheels for collected packages: shap
  Building wheel for shap (setup.py) ... done
  Created wheel for shap: filename=shap-0.33.0-cp36-cp36m-linux_x86_64.whl size=382249 sha256=a619de06b55a5275d81b9a3924cc27a343d5dc699e345d7ef73e9505946821cf
  Stored in directory: /root/.cache/pip/wheels/39/0f/88/a8124d43431284e10f263ffe449e119344c6145c3a165d186c
Successfully built shap
Installing collected packages: shap
Successfully installed shap-0.33.0
In [0]:
def IntergrateSHAP(feature_names,feature_names_raw,shap_range,shap_values):
  feature_list = list(feature_names).copy()
  shap_integrated_all=[]
  for i in range(shap_range):
    shap_integrated=[]
    for column in feature_names_raw:
        shap_integrating = 0
        for feature_name in feature_names:
            if feature_name.startswith(column+'_'):
                feature_index = feature_list.index(feature_name)
                shap_integrating = shap_integrating + shap_values[i][feature_index]
        shap_integrated.append(shap_integrating)
    print(shap_integrated)
    shap_integrated_all.append(shap_integrated)
  return shap_integrated_all

def PlotSHAP(MLP_BinAll,x_train_BinAll,feature_names_BinAll,x_train_raw,feature_names_raw,shap_range):
  # define the explainer
  explainer = shap.KernelExplainer(MLP_BinAll.predict,shap.sample(x_train_BinAll, 50))
  # calculate the shape value on training data
  # set approximate=True for fast processing
  shap_values = explainer.shap_values(x_train_BinAll[:shap_range], approximate=True)
  print(shap_values)
  

  ## shap_values_int = shap values after integration, it is a np.array with shape: (shap_range,12)
  ## feature_names = np.array(['x1','x2',...,'x23'])
  shap_values_int=np.array(IntergrateSHAP(feature_names_BinAll,feature_names_raw,shap_range,shap_values))
  feature_names=feature_names_raw

  shap.summary_plot(np.array(shap_values_int), feature_names=feature_names, plot_type='bar')
  shap.summary_plot(np.array(shap_values_int), feature_names=feature_names)
  
  # plot pdp
  for column in feature_names_raw:
    shap.dependence_plot(column, shap_values_int, x_train_raw[:shap_range], feature_names=feature_names)
  
  return (shap_values_int,feature_names)

(shap_values_int,feature_names) = PlotSHAP(MLP_BinAll,x_train_BinAll,feature_names_BinAll,x_train_raw,feature_names_raw,50)
[[ 0.02766379  0.07155907  0.00414827 ...  0.          0.
  -0.00976765]
 [-0.10159977  0.          0.         ...  0.          0.
   0.        ]
 [ 0.02804328  0.05613811  0.         ...  0.          0.
   0.        ]
 ...
 [ 0.01818077  0.04751158  0.         ...  0.00293037  0.00964545
   0.        ]
 [-0.08718068  0.         -0.00215513 ...  0.         -0.00429145
   0.        ]
 [ 0.01060109  0.0271138   0.         ... -0.00917211 -0.01509653
   0.        ]]
[0.22987334129610987, 0.013164123275038993, -0.01146878512703069, 0.11688258820674682, 0.010057942737071966, -0.010431549677799766, 0.004258376168904954, 0.03229543200506745, -0.030544243781928722, 0.156735954908815, 0.0711730100018936, -0.04199619001288957]
[-0.1338062988686591, -0.03598746000136449, 0.0, -0.08999974099040306, -0.047199527462381374, -0.04663817933347553, 0.0, 0.02512771932618532, -0.12540322163037207, -0.0060932910395298245, 0.0, 0.0]
[0.27001733777366116, 0.060765467765020106, 0.0, 0.11983118658127043, 0.011348611192609458, 0.13401011084614853, -0.024359592981271144, -0.05067836906802076, -0.04854764177754547, 0.04046985904232708, 0.02714303062580037, 0.0]
[-0.30939039923945555, -0.0014690697389157306, 0.0, 0.07490765904683333, -0.00645593928559157, -0.16549085893123816, 0.015273401891358555, 0.04539434282264464, -0.22501497072018536, 0.09916990910453512, 0.029861061338462447, -0.016785136288447855]
[-0.12177030503733469, -0.005990555149325201, -0.01409788684275608, 0.032447896999717875, -0.08558012917592497, -0.13293918453250697, -0.017775477623677143, 0.015048020391939587, -0.10230055827166722, -0.041558322073259414, 0.0, 0.014516501314794261]
[0.1714400225385348, 0.012969777399478619, 0.011237237685648327, 0.04531906325086826, -0.057651867388721544, 0.08314167443618667, 0.0050912169424347925, 0.02047827296981561, 0.16228068793147638, 0.06674501239183168, 0.03771318778421806, -0.01876428594177182]
[0.18600999486691477, 0.01527957158355024, -0.004473101993829817, 0.11312117487752274, -0.016849892195012695, 0.14524968540789007, 0.010688331444144182, 0.02576849172056788, -0.05625621037040636, 0.07697823756364003, 0.054875351326432883, -0.010391634231414204]
[0.11333151850514532, 0.028829437811489417, -0.012419063097721936, 0.06587689989019768, 0.03546859725247757, 0.08914193200555894, 0.02999945300928409, -0.01578155998929371, 0.13975321987625983, 0.033253033294096734, 0.027727425156934216, 0.004819106285571917]
[0.07569465292956609, -0.07720754854191882, 0.0, -0.23556740426421557, -0.08964669553688215, 0.08643608343702179, -0.011925186475885153, -0.028176023601891503, -0.21041477629518324, -0.03917539573517126, 0.037851346421302494, 0.03213094766325725]
[-0.13012916261933438, -0.023619922435660204, -0.003356770128795511, -0.08982823250575384, -0.01821911001521216, -0.11211271187142854, -0.00709430019131253, -0.021397698825390915, -0.0992892084619276, 0.045047117054815566, 0.0, 0.0]
[0.20900341199105388, 0.020688672652491386, -0.030791163417269375, 0.02656085620295811, 0.06861072238866986, 0.14965259956052757, 0.04674628114728664, 0.049349625279243484, -0.08530433996907079, 0.1558619271095184, -0.035539679816012804, -0.034838913129396465]
[-0.08889827063215297, -0.0048204110349904655, -0.003326934234887495, -0.032339500501885936, -0.053661861680790644, -0.07686348782162625, -0.011182986821961092, -0.02970580882925243, -0.06347525412568417, -0.09518448483268713, 0.0, -0.0005409994840814281]
[-0.11979345021026724, -0.034788805988458016, 0.0, -0.09172260475173802, 0.02589842862073305, 0.029552214358655654, -0.018192652950002025, 0.019002076348291347, -0.10993304983758492, -0.08375287601648324, -0.07626927957314666, 0.0]
[0.10747829390119196, 0.0, 0.0, 0.00827740278418121, 0.047892937765420376, 0.08334103433784561, -0.013900049904179125, 0.017373519268034127, 0.2078895684220005, 0.06972629559362573, 0.02882292018431208, -0.016901922352432597]
[-0.12327588751807589, -0.031052067286418978, 0.0, -0.08701609613399959, -0.06363543601700458, -0.05328874652886996, -0.015866701616041385, -0.028028157657589307, -0.09591641122559172, 0.04098516635417676, 0.0, -0.0029056623705854823]
[0.26171516295878455, -0.003708115831484341, -0.04080440052734312, 0.03916154797145857, 0.07780509249747611, 0.2110244350307072, 0.06022688846438537, 0.06156409620371555, -0.10048583160162088, -0.04027373441007608, 0.05880498167203463, -0.0450301224280375]
[0.10738050156171418, 0.022567810958984708, 0.016984490483405272, 0.039514544908811906, 0.013154410505334307, 0.061391720586771927, 0.01946714834349356, 0.015291791602350405, 0.14882453950941213, 0.07173404325513534, 0.02088215994338105, 0.002806838341204987]
[0.23532303952621775, -0.023506715323767566, -0.16732659068885414, 0.09217295568819114, -0.27805548383411616, 0.1814753406622662, -0.10352153647246473, -0.22678298540515313, -0.25139479379898366, 0.07648277592566388, 0.07989784053479423, -0.07476384681379389]
[-0.040809325270858746, 0.003082501880322202, 0.0, -0.026704225594739728, -0.022205650944302144, 0.058037575120148274, -0.007442960201541823, -0.0838521737947105, -0.1360153921539451, -0.25675548812374593, 0.030716254517442143, 0.021948884565931193]
[0.09237197257462051, 0.004485703225449923, 0.00574589298465213, 0.045082869732635145, 0.031175556493680145, 0.07496205987431254, 0.021632849453595775, 0.01530634167452996, 0.1967597885540552, 0.0269284372324542, 0.02554852820001441, 0.0]
[-0.14944121606587363, 0.01989289921925838, -0.005172280424789712, 0.036360679831277415, -0.026936382834328212, -0.13084341552658163, -0.020550404776384028, 0.0234455459433279, -0.09324126506148858, -0.08340665577081592, -0.04651121244582779, 0.01640370791222573]
[-0.09898939682097965, -0.006797529146899933, 0.0, 0.021034396359122076, -0.027269703992656463, -0.09263667554462238, -0.006425476495749982, -0.03757886453572104, -0.08462270028342123, -0.12671404953907148, 0.0, 0.0]
[-0.024873230710537073, -0.024757056587604326, 0.0, -0.1114035808334743, 0.019091863530022506, -0.04169606793829496, -0.03392369922587207, 0.025694172052302983, -0.1659162628479742, -0.05955438456575157, -0.042661752872817094, 0.0]
[-0.35445102988579874, 0.0, 0.018138263730240176, 0.09600487213062919, -0.08650884851410225, -0.3423528341297837, -0.031038152241880425, -0.028333804826907713, 0.18355654322180126, 0.08177500838701138, 0.0393901507226011, -0.03618016859381051]
[0.2703789334192956, 0.0066274399216643375, -0.01990619054687697, 0.019639769884291447, 0.04349734171607353, 0.14074648781259752, -0.030081707319830356, 0.03985538963568932, -0.06684759848327347, 0.1197233083103752, 0.05875933191532798, -0.042392506265334196]
[0.07452639987880016, 0.03036155898018033, 0.017211217856459765, 0.06812246866810409, 0.06120866938739, 0.08981857268904815, 0.0621807338225023, 0.034487870465340686, 0.14002885595309741, -0.02418760912920882, -0.035387101487844984, 0.02162836291613074]
[-0.029981411670823615, -0.007104504400179096, -0.00623134298949634, -0.09284461183795632, -0.027467854584455095, -0.07431846271682004, -0.005237108611956615, -0.013562864468623731, -0.23228159716468283, -0.013243277876332338, 0.021865840043284568, 0.020407196278041384]
[0.11876563034931588, 0.008019839368584629, 0.017887219390746743, 0.061162881251081505, -0.0448858566604675, 0.08637437045760542, 0.00043486511817696777, 0.02399604841218464, 0.23395700708869777, -0.021626133507793868, 0.03814438574510706, 0.01776974298676076]
[0.1529032954689204, 0.009871638702916241, 0.0, 0.05323052557376273, -0.07283862156833024, 0.08112181441592259, 0.0, 0.02168830414311146, 0.2361429661303116, 0.003840975716451367, 0.026219224717560052, 0.02781987669937369]
[0.165284512259799, 0.03485707208278971, 0.018606301818020143, 0.007358558731780568, -0.052820045198447935, -0.005716091970224324, 0.009560678082915497, -0.049443311791158295, 0.24604246008721664, 0.14088392552150503, 0.054209767571502976, -0.028823827195699175]
[-0.11210109217292023, 0.05636813166711396, 0.025055603614421607, 0.06360090096974365, 0.056794160967402536, 0.00470224312928607, 0.07237193054574616, 0.05213816738137736, 0.32714603238854345, 0.047469629647189086, -0.03359693867613811, -0.019948769461765603]
[-0.3333414731986988, 0.035156799457466636, 0.0, 0.08820184928569097, 0.0954402336425356, -0.1652693345700397, 0.06575354741200712, 0.042496123639129885, -0.2602064989402556, 0.08768947308547945, -0.13724131190410982, 0.021320592090794233]
[-0.09782477309117327, -0.008048989329978242, -0.01145828932634833, -0.02277664688429152, -0.059095146283805744, -0.0896355721734883, -0.0041181318050970694, -0.0327516043079612, -0.0728683306489431, -0.06142251614891331, 0.0, 0.0]
[0.12252828667359605, -0.027267873639875206, -0.026155636321720258, 0.06810376333065807, 0.04839412842359421, 0.08688512313199039, 0.011288735452081679, 0.03036603550763274, 0.2400884879760683, 0.03481576290158683, -0.033482255040654885, -0.01556455839495785]
[-0.21433351909647028, 0.04056812173957866, 0.022365904759427346, 0.06290982076229967, -0.12930759431256267, -0.21270388447292088, -0.058233259080515715, 0.047898645070302726, 0.039213069684326946, 0.027485600029317936, -0.052663082908451286, -0.03319982217433248]
[-0.04836952548027912, 0.053606635402419694, 0.010586013217734995, -0.1784542619700389, 0.059708645912936106, 0.1047589282024193, 0.04366302829077246, 0.05064462949086899, -0.22243245718748408, -0.3077022638391328, -0.0733694426141214, 0.04736007057390479]
[-0.0401255449236707, 0.0005370938983254303, 0.0, -0.017447031294584685, 0.0029893920160071266, -0.07941872502796667, -0.03829327820953345, -0.032316048126094066, -0.22318253842868457, -0.019999732460559375, 0.01383962939083222, -0.026583216834071333]
[-0.09554127945690476, -0.02402421070478941, 0.0, -0.06629564910487366, -0.06520525200360865, 0.018597842525966773, -0.01766696549655734, -0.02508711463385395, -0.07814346474201112, -0.12245000556340022, 0.0, 0.01581609918003235]
[-0.04363582604188746, -0.015444014355724206, -0.0034860362139447366, 0.054819108030090845, 0.05504120545020768, -0.24223189293532776, 0.04025222783439944, -0.10877438319156338, -0.21188533585583352, 0.07527320799084057, -0.067986355373203, 0.008058094661945514]
[0.07791620529573434, 0.019036150243866796, 0.015614046926746575, 0.06365921563226647, 0.1376959952509501, 0.22060378797628888, -0.05821577208838698, 0.0902436910839442, -0.11327051865116543, -0.0703929743898001, 0.11346258283289301, 0.04364758988666195]
[-0.052302470322885986, -0.059381642013283456, 0.0, -0.140188136735395, -0.007627953127241252, -0.06587429623586588, -0.015258930899331936, 0.03556391773496315, -0.18121403374204542, -0.018014150938998455, 0.01624498854515788, 0.028052707734926297]
[-0.12476118021437857, -0.02375923366895935, -0.012387725590093268, -0.07889759423019754, -0.011791254686812191, 0.024232827964865034, -0.01071773511235996, -0.009425461973887383, -0.10431414509924797, -0.04412066266697259, -0.06405783472195625, 0.0]
[-0.04008390155719759, 0.07903887510957465, -0.08340430291698064, 0.14515032548452214, 0.09318248418024519, 0.12800516456158173, 0.07309347536032908, -0.2585661547622471, -0.21265278591933667, -0.12145632266776363, -0.15901811360232276, -0.10328874327040444]
[-0.16592480140100774, 0.0, 0.0, -0.017048372190917066, 0.03675680706622586, -0.14846100196027595, -0.024668162324489873, 0.025342744758834718, -0.14694354161089002, -0.013127380443657022, 0.011866718556932332, -0.017793010450755237]
[0.4154653596424265, -0.015328519848864672, 0.0, 0.11784506102683318, -0.07950346953393059, -0.060993757954094674, -0.0341265400303921, 0.03848414958344143, -0.0786380682024629, 0.20995211692887564, 0.05678582371697216, -0.029942155328803954]
[-0.310852533829154, -0.06077706438366119, 0.009658241640799828, 0.07571946423085557, 0.0007268087332095614, -0.15543730736897815, -0.03330863258524004, 0.041537172363955134, -0.2191506846803337, 0.12083142895660835, 0.040961129863337775, 0.03009197705860084]
[-0.1741164898157045, -0.016652945469341207, 0.0, 0.03647632468186607, 0.05299004611712371, -0.07033087559082779, 0.02552382256799099, -0.07033294836863677, -0.17593445226791601, -0.019062364978136542, -0.039327249563876066, -0.00923286731254197]
[0.20584155312287722, 0.05130428190383585, 0.015512158097019224, 0.06222602716988168, 0.042111943524357764, 0.10309149131796363, -0.0019814020539651034, -0.07049827469605693, -0.04422078085257228, 0.11022630669175847, 0.04865314398474643, 0.017733551790154012]
[-0.1207645481698538, -0.00425952809205038, -0.001570177706279291, -0.040623439455407334, -0.022558996772292728, -0.11606989957898217, -0.006703255001287822, -0.04826633833534538, -0.08813823180752153, -0.006754136982125097, 0.0, -0.004291448098854567]
[-0.07418252624776966, -0.00892187434116945, 0.003848408726372432, -0.0684091249183191, 0.05233588838854439, 0.10205674343568832, -0.03249139093150416, -0.008930537480759837, -0.22108034589544892, -0.21980457933642958, 0.029650730936260583, -0.014071392335465116]
In [0]:
# force_plot for one trainning data X_i
shap.initjs()
i=1
shap.force_plot(explainer.expected_value, shap_values_int[i], x_train_raw[i], feature_names=feature_names)
Out[0]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [0]:
# force_plot for all trainning data
shap.initjs()
shap.force_plot(explainer.expected_value, np.array(shap_values_int), x_train_raw, feature_names=feature_names)
Out[0]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

MLP Global interpretation

It can be conclude from the above graphs that the importance gap between features for this Multi-layer perceptron (MLP) classifier model are not dramatically large. All variables except x2(Months Since Oldest Trade Open), x3(Months Since Most Recent Trade Open), x12(Number of Total Trades (total number of credit accounts)), x23(Percent Trades with Balance) impact the model’s prediction output considerably, especially for x15(Months Since Most Recent Inq excl 7days) and x1(Consolidated version of risk markers).

From the partial dependence plots, we found that variable x1(Consolidated version of risk markers), x4(Average Months in File), x5(Number Satisfactory Trades) and x8(Percent Trades Never Delinquent) are positively correlated with the probability of being marked as good for Risk Flag, while x14(Percent Installment Trades), x18(Revolving balance divided by credit limit) and x20(Number Revolving Trades with Balance) are negatively correlated with the probability of being marked as good for Risk Flag, which agrees with our intuition and prior knowledge.


MLP Local interpretation

We are interested in how each feature pushes the sample prediction from the SHAP base value to the final result. The above graph visualized the effects of the second sample (sample index=1) in the training dataset. The prediction result for the first sample is -0.26, indicating a high likelihood to loan default. Starting from the base value 0.2, the predicted SHAP value mainly driven towards the negative outcome (Bad) by factors including x1(Consolidated version of risk markers), x15(Months Since Most Recent Inq excl 7days) and x4(Average Months in File). It can be interpreted as the customer has is likely to default on the loan for his/her short history in file, and relatively low score in the consolidated risk markers.

GB + Raw

In [0]:
shap_range = 50
clf = GradientBoostingClassifier(learning_rate=0.1, max_depth=5, n_estimators=60)
clf.fit(x_train_raw, y_train)
# define the explainer
explainer = shap.KernelExplainer(clf.predict,shap.sample(x_train_raw, 50))
# calculate the shape value on training data
# set approximate=True for fast processing
shap_values_int = explainer.shap_values(x_train_raw[:shap_range], approximate=True)

shap.summary_plot(np.array(shap_values_int), feature_names=feature_names_raw, plot_type='bar')
shap.summary_plot(np.array(shap_values_int), feature_names=feature_names_raw)

# plot pdp
for column in feature_names_raw:
  shap.dependence_plot(column, shap_values_int, x_train_raw[:shap_range], feature_names=feature_names_raw)
In [0]:
# force_plot for one trainning data X_i
shap.initjs()
i=1
shap.force_plot(explainer.expected_value, shap_values_int[i], x_train_raw[i], feature_names=feature_names)
Out[0]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [0]:
# force_plot for all trainning data
shap.initjs()
shap.force_plot(explainer.expected_value, np.array(shap_values_int), x_train_raw, feature_names=feature_names)
Out[0]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Our Gradient Boosting Classifier with learning_rate=0.1, max_depth=5, n_estimators=60 yield a test accuracy to 77.42%. This overwhelming accuracy makes us more intrigued to know about what is happening in this black box model. We use KernelExplainer to interpret the model. From the mean value of absolution SHAP, we know x15,x1,x8,x4,x18 have a strong effect when estimating y, which agrees with the result in MLP SHAP.


From the PDP graph of x15, we know once x15 is lower than about 2.5, the SHAP values tend to be negative, which means the model dislikes x15. Practically, once someone’s Months Since Most Recent Inquiry is too low than a threshold, it is likely that there will be bad debt. This means if you ask too frequent, it is likely to be a ‘Bad’ RiskFlag.

From the PDP graph of x1, we know the higher the x1, the response is more likely to be a ‘Good’ one. Empirically, the higher the consolidated version of risk markers, the safer a loan is.

From the PDP graph of x8, SHAP value tends to increase as x8 value increases. The watershed effect also prevails. If the Percent of Trades Never Delinquent is high enough, then it is much more likely to be a ‘Good’ RiskFlag.

The PDP graph of x4 shows that more data is shown to locate around SHAP value = 0 compared to the above three plots. But the relationship is prominently positive. Practically, the higher someone’s Average Months in File is recorded, the higher chance he is going to be marked ‘Good’. While for most people the model doesn’t think their months in file matter.

The PDP graph of x18 suggests while most data has SHAP value = 0, data in the highest 20% quantile of x18 tends to push its SHAP value extreme lower. It indicates once someone has a too high Net Fraction Revolving Burden, it is much more likely that he will be marked ‘Bad’.

SVM (rbf) + IV Binning

In [0]:
def IntergrateSHAP(feature_names,feature_names_raw,shap_range,shap_values):
  feature_list = list(feature_names).copy()
  shap_integrated_all=[]
  for i in range(shap_range):
    shap_integrated=[]
    for column in feature_names_raw:
        shap_integrating = 0
        for feature_name in feature_names:
            if feature_name.startswith(column+'_'):
                feature_index = feature_list.index(feature_name)
                shap_integrating = shap_integrating + shap_values[i][feature_index]
        shap_integrated.append(shap_integrating)
    print(shap_integrated)
    shap_integrated_all.append(shap_integrated)
  return shap_integrated_all

def PlotSHAP(MLP_BinAll,x_train_BinAll,feature_names_BinAll,x_train_raw,feature_names_raw,shap_range):
  # define the explainer
  explainer = shap.KernelExplainer(MLP_BinAll.predict,shap.sample(x_train_BinAll, 50))
  # calculate the shape value on training data
  # set approximate=True for fast processing
  shap_values = explainer.shap_values(x_train_BinAll[:shap_range], approximate=True)
  print(shap_values)
  

  ## shap_values_int = shap values after integration, it is a np.array with shape: (shap_range,12)
  ## feature_names = np.array(['x1','x2',...,'x23'])
  shap_values_int=np.array(IntergrateSHAP(feature_names_BinAll,feature_names_raw,shap_range,shap_values))
  feature_names=feature_names_raw

  shap.summary_plot(np.array(shap_values_int), feature_names=feature_names, plot_type='bar')
  shap.summary_plot(np.array(shap_values_int), feature_names=feature_names)
  
  # plot pdp
  for column in feature_names_raw:
    shap.dependence_plot(column, shap_values_int, x_train_raw[:shap_range], feature_names=feature_names)
  
  return (shap_values_int,feature_names)


clf = SVC(degree=3,C=7,kernel='rbf')
clf.fit(x_train_BinAll,y_train)
(shap_values_int,feature_names) = PlotSHAP(clf,x_train_BinAll,feature_names_BinAll,x_train_raw,feature_names_raw,10)
[[ 0.03723357  0.06378685  0.03622663  0.27383726  0.          0.0078161
   0.          0.          0.00747226  0.          0.00496799 -0.01451011
  -0.01679008  0.          0.          0.          0.          0.02090039
   0.          0.          0.          0.00445398 -0.02742318  0.04314508
   0.00456622  0.02645661 -0.04122028  0.01032649  0.          0.00613195
   0.          0.          0.01717513  0.          0.          0.
  -0.03396465  0.03099737 -0.02698852  0.00546168  0.09320201 -0.0164253
   0.          0.00900196  0.00450333  0.01316492  0.          0.
   0.00760694  0.00888743  0.          0.        ]
 [-0.13306982  0.          0.         -0.04531444 -0.03482501 -0.02520238
  -0.00480847  0.         -0.00517977  0.          0.         -0.006692
   0.          0.         -0.03075567 -0.00168719  0.         -0.00907077
   0.         -0.01545563  0.          0.         -0.01513891  0.
  -0.0012416   0.         -0.03204814  0.          0.          0.
   0.          0.          0.         -0.0023419   0.          0.
  -0.04990156  0.         -0.02758495 -0.00647758 -0.01192578  0.02120685
   0.          0.          0.          0.          0.          0.
   0.          0.          0.         -0.00248527]
 [ 0.0256107   0.04511548  0.03229424 -0.02252658  0.32032842  0.
   0.00807065  0.          0.          0.0335422   0.          0.
   0.          0.          0.01439823  0.          0.          0.04428525
   0.          0.          0.02440313  0.         -0.02409037  0.02965012
   0.00608542  0.          0.06075335  0.          0.00682034  0.
  -0.01264807 -0.01212099  0.         -0.01173088 -0.01311743  0.0069731
  -0.04125784  0.01531714 -0.02091525  0.          0.          0.02815046
  -0.01199277  0.00698047  0.          0.01925422  0.01866226  0.
   0.          0.         -0.01629502  0.        ]
 [ 0.         -0.15326231  0.         -0.12223405 -0.05772802  0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.01211331  0.          0.          0.
   0.          0.          0.          0.         -0.02360215  0.01248942
  -0.01064873  0.         -0.05657604  0.          0.          0.01529834
   0.          0.          0.0144051   0.          0.          0.
  -0.08256802  0.         -0.01911688  0.          0.0498333  -0.01840328
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.        ]
 [ 0.         -0.08458759 -0.00262409 -0.04612732 -0.04477406  0.
   0.         -0.00107841 -0.00391298  0.         -0.00153384 -0.00428042
   0.          0.          0.          0.          0.          0.
  -0.04040205  0.          0.         -0.00071594 -0.01341867 -0.05509749
  -0.00109465 -0.0011099  -0.04603627 -0.00760585  0.          0.
   0.         -0.00147548  0.         -0.00168869  0.          0.
  -0.0298273   0.         -0.02186988 -0.00081792 -0.01404199 -0.00299898
   0.          0.          0.          0.          0.         -0.00652997
   0.         -0.00501684  0.         -0.00133344]
 [ 0.03656053  0.06171246  0.         -0.0219723   0.23217262  0.00490658
   0.          0.          0.          0.          0.01328488 -0.01895358
   0.          0.          0.01348094  0.          0.          0.00919964
  -0.03770996  0.          0.          0.          0.          0.03473247
   0.          0.          0.04065073  0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.0568512   0.01057218  0.          0.03016909  0.07580953 -0.02782726
   0.          0.          0.01360737  0.01279353  0.          0.00826855
   0.          0.01169082  0.          0.        ]
 [ 0.02903338  0.04957361  0.04588589  0.23649265  0.          0.01471191
   0.         -0.01086729  0.          0.          0.         -0.01395701
  -0.01253019  0.          0.01863158  0.01136253  0.          0.03186618
   0.01042504  0.00877082  0.01039006  0.         -0.02711601  0.02983693
   0.          0.          0.06425409  0.          0.          0.00757639
  -0.01303473  0.          0.00869542  0.          0.00464759  0.01952689
  -0.0404876   0.04351434 -0.02274988  0.00980733 -0.01949979  0.05330582
   0.          0.          0.00332225  0.01954446  0.00469362  0.
   0.          0.         -0.01562628  0.        ]
 [ 0.03504755  0.06432556  0.          0.12320643  0.          0.
   0.00740102  0.01033257  0.          0.          0.          0.
   0.          0.          0.01440442  0.          0.          0.03727792
   0.00294039  0.          0.          0.          0.03438512  0.02698434
   0.          0.          0.05086924  0.00728964  0.          0.
   0.00645377  0.00171501  0.          0.00503317  0.00284896  0.00197848
   0.08906632  0.          0.          0.          0.          0.02405268
   0.          0.00530979  0.00103499  0.00804262  0.          0.
   0.          0.          0.          0.        ]
 [ 0.04128573  0.07325278  0.0634296   0.26319792 -0.03847169 -0.13798501
  -0.02078638 -0.01893605  0.00907495  0.          0.02367528 -0.02350255
   0.         -0.0153115  -0.07700127  0.02022705  0.         -0.04125793
   0.01702347 -0.08868392  0.          0.         -0.05353996  0.04057678
   0.02034346  0.          0.09097547  0.00225879  0.          0.06685626
   0.04735996  0.02608824 -0.02992507  0.13091822  0.01692056  0.
  -0.08081557  0.04341694 -0.04965894 -0.02272103 -0.04658125  0.09218359
   0.01275416  0.02325035  0.          0.05513507  0.03683607  0.
   0.01100446  0.06383208 -0.0078576   0.02115845]
 [-0.12614358  0.         -0.00324982 -0.04735336 -0.03908403 -0.01873773
   0.          0.          0.          0.          0.          0.
   0.          0.         -0.03751092  0.         -0.00210468 -0.00146519
   0.          0.          0.          0.         -0.01224251 -0.04911216
   0.         -0.00166139 -0.0206879   0.          0.          0.
  -0.0060264   0.         -0.00778721 -0.00248891 -0.00115697  0.
  -0.0308328   0.         -0.01891945  0.          0.         -0.00723746
   0.          0.         -0.00238424  0.          0.          0.
   0.         -0.00381328  0.          0.        ]]
[0.4110843084935243, 0.015288351441404011, -0.026332194157384702, 0.020900385206575015, -0.022969201503347032, 0.03294762310915361, 0.016458440339370335, 0.017175134159848383, -0.024494129556001623, 0.09028199505266982, 0.013164924181626899, 0.016494363232561038]
[-0.21320927447201635, -0.03519061396788062, -0.0066919985326041875, -0.04151363201221117, -0.030594545068798384, -0.033289733502899294, 0.0, -0.00234190443581081, -0.08396409618496042, 0.009281068039947299, 0.0, -0.0024852698627661485]
[0.400822264540682, 0.04161285128616948, 0.0, 0.058683481704597085, 0.0003127630731028508, 0.09648889048544418, -0.017948721504417425, -0.01787520884950733, -0.046855946444847224, 0.02313815999440244, 0.03791648511963852, -0.016295019405264344]
[-0.3332243747112081, 0.0, 0.0, 0.012113312535168647, -0.023602145667895768, -0.054735343404590714, 0.015298338317385637, 0.014405098297582668, -0.1016849045313274, 0.03143001916488497, 0.0, 0.0]
[-0.1781130583293103, -0.004991382291633355, -0.005814262238741701, 0.0, -0.054536653359554356, -0.10333830881986392, -0.00908132651983374, -0.0016886931664155148, -0.0525150971143036, -0.017040965687918526, -0.006529965967047572, -0.006350286505377334]
[0.3084733149705067, 0.004906583501737449, -0.0056687040112656795, 0.02268057940243637, -0.03770996244715363, 0.075383194855489, 0.0, 0.0, 0.09759246810504574, 0.06158963372894563, 0.0210620748042325, 0.011690817090025951]
[0.36098553220402074, 0.0038446256395717956, -0.026487200733855035, 0.06186028492234138, 0.0024699190156972015, 0.09409102271867417, -0.0054583444557133945, 0.03286990083842224, -0.00991581596355414, 0.03712827978388432, 0.02423807809866607, -0.015626282068155528]
[0.22257953830260901, 0.01773358156038761, 0.0, 0.05168234042387257, 0.0373255145251983, 0.07785358735219532, 0.015458417598078415, 0.009860606829138767, 0.08906632467636749, 0.030397465194583674, 0.008042623537568794, 0.0]
[0.4026943449449403, -0.16863248523273452, -0.01513875827175613, -0.0980321431473398, -0.12520040078329941, 0.15189570819840892, 0.14256324248017008, 0.11791371753868249, -0.1097786088861685, 0.0816068609139829, 0.09197113801855492, 0.08813738422655865]
[-0.21583078782919723, -0.01873772569173926, 0.0, -0.0410807827181672, -0.012242510222183284, -0.07146146173100576, -0.006026402974874279, -0.011433093248929216, -0.04975224927446506, -0.009621705372867745, 0.0, -0.0038132809365710174]
In [0]:
# force_plot for one trainning data X_i
shap.initjs()
i=1
shap.force_plot(explainer.expected_value, shap_values_int[i], x_train_raw[i], feature_names=feature_names)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-39-d6a0cb4fd9b5> in <module>()
      1 shap.initjs()
      2 i=1
----> 3 shap.force_plot(explainer.expected_value, shap_values_int[i], x_train_raw[i], feature_names=feature_names)

NameError: name 'explainer' is not defined
In [0]:
# force_plot for all trainning data
shap.initjs()
shap.force_plot(explainer.expected_value, np.array(shap_values_int), x_train_raw, feature_names=feature_names)

Decision Tree + Raw

In [0]:
from sklearn import tree
import pydotplus
import graphviz

clf.best_estimator_.fit(x_train_raw, y_train)

feature_names_raw = df_test_clean.columns[1:].values

dot_data = tree.export_graphviz(clf.best_estimator_, out_file=None,
                                   feature_names=feature_names_raw,
                                   filled=True, rounded=True,
                                   special_characters=True) 

pydot_graph = pydotplus.graph_from_dot_data(dot_data)
pydot_graph.set_size('"10,7!"')
gvz_graph = graphviz.Source(pydot_graph.to_string())
gvz_graph
Out[0]:
Tree 0 x1 ≤ 73.5 gini = 0.499 samples = 7896 value = [4083, 3813] 1 x15 ≤ 1.996 gini = 0.419 samples = 4429 value = [3108, 1321] 0->1 True 60 x15 ≤ 2.977 gini = 0.404 samples = 3467 value = [975, 2492] 0->60 False 2 x1 ≤ 66.839 gini = 0.34 samples = 3270 value = [2559, 711] 1->2 39 x1 ≤ 64.5 gini = 0.499 samples = 1159 value = [549, 610] 1->39 3 x18 ≤ 59.115 gini = 0.269 samples = 2060 value = [1730, 330] 2->3 24 x15 ≤ 1.001 gini = 0.431 samples = 1210 value = [829, 381] 2->24 4 x2 ≤ 215.5 gini = 0.333 samples = 1082 value = [854, 228] 3->4 15 x15 ≤ 1.939 gini = 0.187 samples = 978 value = [876, 102] 3->15 5 x15 ≤ 1.628 gini = 0.289 samples = 787 value = [649, 138] 4->5 10 x23 ≤ 77.5 gini = 0.424 samples = 295 value = [205, 90] 4->10 6 x15 ≤ 0.105 gini = 0.277 samples = 766 value = [639, 127] 5->6 9 gini = 0.499 samples = 21 value = [10, 11] 5->9 7 gini = 0.327 samples = 543 value = [431, 112] 6->7 8 gini = 0.125 samples = 223 value = [208, 15] 6->8 11 x8 ≤ -2.846 gini = 0.462 samples = 210 value = [134, 76] 10->11 14 gini = 0.275 samples = 85 value = [71, 14] 10->14 12 gini = 0.0 samples = 12 value = [12, 0] 11->12 13 gini = 0.473 samples = 198 value = [122, 76] 11->13 16 x15 ≤ 1.003 gini = 0.184 samples = 976 value = [876, 100] 15->16 23 gini = 0.0 samples = 2 value = [0, 2] 15->23 17 x2 ≤ 87.0 gini = 0.205 samples = 803 value = [710, 93] 16->17 20 x4 ≤ 11.0 gini = 0.078 samples = 173 value = [166, 7] 16->20 18 gini = 0.065 samples = 178 value = [172, 6] 17->18 19 gini = 0.24 samples = 625 value = [538, 87] 17->19 21 gini = 0.0 samples = 2 value = [0, 2] 20->21 22 gini = 0.057 samples = 171 value = [166, 5] 20->22 25 x5 ≤ 27.5 gini = 0.469 samples = 1010 value = [630, 380] 24->25 36 x14 ≤ 3.0 gini = 0.01 samples = 200 value = [199, 1] 24->36 26 x5 ≤ 14.5 gini = 0.432 samples = 730 value = [500, 230] 25->26 33 x2 ≤ 218.5 gini = 0.497 samples = 280 value = [130, 150] 25->33 27 x14 ≤ 43.5 gini = 0.348 samples = 294 value = [228, 66] 26->27 30 x4 ≤ 107.5 gini = 0.469 samples = 436 value = [272, 164] 26->30 28 gini = 0.42 samples = 173 value = [121, 52] 27->28 29 gini = 0.205 samples = 121 value = [107, 14] 27->29 31 gini = 0.451 samples = 387 value = [254, 133] 30->31 32 gini = 0.465 samples = 49 value = [18, 31] 30->32 34 gini = 0.497 samples = 159 value = [86, 73] 33->34 35 gini = 0.463 samples = 121 value = [44, 77] 33->35 37 gini = 0.444 samples = 3 value = [2, 1] 36->37 38 gini = 0.0 samples = 197 value = [197, 0] 36->38 40 x15 ≤ 2.731 gini = 0.461 samples = 386 value = [247, 139] 39->40 45 x15 ≤ 2.496 gini = 0.476 samples = 773 value = [302, 471] 39->45 41 gini = 0.481 samples = 112 value = [45, 67] 40->41 42 x14 ≤ 38.5 gini = 0.387 samples = 274 value = [202, 72] 40->42 43 gini = 0.454 samples = 158 value = [103, 55] 42->43 44 gini = 0.25 samples = 116 value = [99, 17] 42->44 46 x3 ≤ -0.471 gini = 0.488 samples = 184 value = [106, 78] 45->46 49 x15 ≤ 3.988 gini = 0.444 samples = 589 value = [196, 393] 45->49 47 gini = 0.46 samples = 81 value = [29, 52] 46->47 48 gini = 0.377 samples = 103 value = [77, 26] 46->48 50 x14 ≤ 6.5 gini = 0.294 samples = 251 value = [45, 206] 49->50 55 x8 ≤ -0.573 gini = 0.494 samples = 338 value = [151, 187] 49->55 51 gini = 0.0 samples = 5 value = [5, 0] 50->51 52 x3 ≤ 2.793 gini = 0.272 samples = 246 value = [40, 206] 50->52 53 gini = 0.258 samples = 243 value = [37, 206] 52->53 54 gini = 0.0 samples = 3 value = [3, 0] 52->54 56 gini = 0.476 samples = 77 value = [47, 30] 55->56 57 x4 ≤ 54.5 gini = 0.479 samples = 261 value = [104, 157] 55->57 58 gini = 0.477 samples = 61 value = [37, 24] 57->58 59 gini = 0.446 samples = 200 value = [67, 133] 57->59 61 x15 ≤ 2.001 gini = 0.477 samples = 1922 value = [757, 1165] 60->61 86 x5 ≤ 2.5 gini = 0.242 samples = 1545 value = [218, 1327] 60->86 62 x1 ≤ 79.5 gini = 0.455 samples = 1779 value = [623, 1156] 61->62 85 gini = 0.118 samples = 143 value = [134, 9] 61->85 63 x5 ≤ 24.5 gini = 0.498 samples = 795 value = [372, 423] 62->63 74 x4 ≤ 73.5 gini = 0.38 samples = 984 value = [251, 733] 62->74 64 x1 ≤ 75.887 gini = 0.495 samples = 482 value = [266, 216] 63->64 69 x15 ≤ 1.989 gini = 0.448 samples = 313 value = [106, 207] 63->69 65 gini = 0.439 samples = 166 value = [112, 54] 64->65 66 x5 ≤ 8.5 gini = 0.5 samples = 316 value = [154, 162] 64->66 67 gini = 0.387 samples = 42 value = [31, 11] 66->67 68 gini = 0.495 samples = 274 value = [123, 151] 66->68 70 x15 ≤ 1.395 gini = 0.471 samples = 277 value = [105, 172] 69->70 73 gini = 0.054 samples = 36 value = [1, 35] 69->73 71 gini = 0.456 samples = 265 value = [93, 172] 70->71 72 gini = 0.0 samples = 12 value = [12, 0] 70->72 75 x18 ≤ 34.5 gini = 0.478 samples = 258 value = [102, 156] 74->75 80 x8 ≤ 0.017 gini = 0.326 samples = 726 value = [149, 577] 74->80 76 x4 ≤ 42.0 gini = 0.464 samples = 243 value = [89, 154] 75->76 79 gini = 0.231 samples = 15 value = [13, 2] 75->79 77 gini = 0.466 samples = 27 value = [17, 10] 76->77 78 gini = 0.444 samples = 216 value = [72, 144] 76->78 81 gini = 0.498 samples = 47 value = [22, 25] 80->81 82 x18 ≤ 18.5 gini = 0.304 samples = 679 value = [127, 552] 80->82 83 gini = 0.261 samples = 537 value = [83, 454] 82->83 84 gini = 0.428 samples = 142 value = [44, 98] 82->84 87 gini = 0.444 samples = 12 value = [8, 4] 86->87 88 x18 ≤ 45.989 gini = 0.236 samples = 1533 value = [210, 1323] 86->88 89 x4 ≤ 51.5 gini = 0.217 samples = 1412 value = [175, 1237] 88->89 98 gini = 0.411 samples = 121 value = [35, 86] 88->98 90 gini = 0.395 samples = 85 value = [23, 62] 89->90 91 x15 ≤ 3.375 gini = 0.203 samples = 1327 value = [152, 1175] 89->91 92 x1 ≤ 81.5 gini = 0.315 samples = 204 value = [40, 164] 91->92 95 x15 ≤ 4.921 gini = 0.18 samples = 1123 value = [112, 1011] 91->95 93 gini = 0.12 samples = 125 value = [8, 117] 92->93 94 gini = 0.482 samples = 79 value = [32, 47] 92->94 96 gini = 0.098 samples = 522 value = [27, 495] 95->96 97 gini = 0.243 samples = 601 value = [85, 516] 95->97

Decision Tree Interpretation

The fitted decision tree model firstly cut at x1(Consolidated version of risk markers) at 73.5 and then at x15(Months Since Most Recent Inq excl 7days). The first two layer of branches divided the situation into 3 categories. If one has a score of higher than 73.5 in Consolidated version of risk markers, the loan is very likely to be predicted to have a ‘good’ risk flag. If one has a score of higher than 66.839 in Consolidated version of risk markers with x15(Months Since Most Recent Inq excl 7days) lower than 1.996, the loan is very likely to be predicted to have a ‘bad’ risk flag. Only remaining factors only have limited impact to the remaining category located in the middle of the tree plot.

Local interpretation: Decision tree model has perfect explainability for each piece of loan application. As they one can easily track the path to reach the predicted state.

GAM + BSpline

In [0]:
# plotting
fig, axs = plt.subplots(3,4,figsize=(10,6))
names = df_test_clean.columns[1:].values
for i, ax in enumerate(axs.flatten()):
    XX = p_spl.generate_X_grid(term=i)
    plt.subplot(ax)
    plt.plot(XX[:, i], p_spl.partial_dependence(term=i, X=XX))
    plt.plot(XX[:, i], p_spl.partial_dependence(term=i, X=XX, width=.95)[1], c='r', ls='--')
    plt.title(names[i])
plt.tight_layout()
In [0]:
p_spl.summary()
LogisticGAM                                                                                               
=============================================== ==========================================================
Distribution:                      BinomialDist Effective DoF:                                     28.4902
Link Function:                        LogitLink Log Likelihood:                                 -4206.6159
Number of Samples:                         7896 AIC:                                             8470.2121
                                                AICc:                                            8470.4407
                                                UBRE:                                               3.0756
                                                Scale:                                                 1.0
                                                Pseudo R-Squared:                                   0.2308
==========================================================================================================
Feature Function                  Lambda               Rank         EDoF         P > x        Sig. Code   
================================= ==================== ============ ============ ============ ============
s(0)                              [0.004]              4            3.2          0.00e+00     ***         
s(1)                              [0.004]              4            2.3          7.22e-01                 
s(2)                              [0.004]              4            1.9          2.42e-01                 
s(3)                              [0.004]              4            1.9          7.26e-07     ***         
s(4)                              [0.004]              4            2.5          2.78e-07     ***         
s(5)                              [0.004]              4            2.5          1.96e-10     ***         
s(6)                              [0.004]              4            2.2          2.84e-01                 
s(7)                              [0.004]              4            2.7          1.52e-09     ***         
s(8)                              [0.004]              4            2.7          0.00e+00     ***         
s(9)                              [0.004]              4            2.2          2.00e-15     ***         
s(10)                             [0.004]              4            2.0          4.03e-08     ***         
s(11)                             [0.004]              4            2.5          5.34e-02     .           
intercept                                              1            0.0          1.74e-01                 
==========================================================================================================
Significance codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

WARNING: Fitting splines and a linear function to a feature introduces a model identifiability problem
         which can cause p-values to appear significant when they are not.

WARNING: p-values calculated in this manner behave correctly for un-penalized models or models with
         known smoothing parameters, but when smoothing parameters have been estimated, the p-values
         are typically lower than they should be, meaning that the tests reject the null too readily.

The partial dependence plot of GAM agrees with the other models.

We can see that the GAM also agrees significant variables x1,x15,x8,x18,and x4. GAM also agrees on the effect directions.

Therefore, we conclude that our base benchmark agrees with the interpretation of other models selected.

Conclusion

image.png

From the above accuracy table, we can found that the simple decision tree outperformed the more complex models (MLP-NN and GAM) with a high testing accuracy of 0.75 and much better explainability. When we compare the decision tree model with the gradient boosting model, we found that the accuracy of gradient boosting model is higher and the lost of explainability is acceptable. Thus, we choose the gradient model as our final model.

From the post-hoc analysis, we found the top 5 most important variables are the same for all models in our final list, indicating the robustness of their importance.

In [0]: